132

Applications in Natural Language Processing

5.6

Outlier Suppression: Pushing the Limit of Low-Bit Trans-

former Language Models

Wei et al. [243] propose a new method to suppress the outliers existing in the language

models and thus pushes the 6-bit post-training quantization (PTQ) and 4-bit quantization-

aware training (QAT) accuracy of BERT to the full-precision level.

Previous works [17, 165] indicate that the Transformer-based models hold significantly

large outliers (even close to 100). Moreover, these extreme outliers behave in structured

patterns. That is, they mainly gather at a few embedding dimensions and even become

larger on unique tokens. Due to these special outliers that can devastate the quantization

performance, the existing method [17] chooses to bypass solutions such as a finer quanti-

zation granularity. However, this finer quantization granularity increases computation cost

and unavoidably hinders the acceleration effect. In contrast, Wei et al. propose to suppress

the outliers rather than walk around them. At first, an in-depth analysis is provided to

investigate the inducement of the outliers and the impact of clipping the outliers.

5.6.1

Analysis

Specifically, the analysis presents two findings: (1) the scaling parameter in LayerNorm

amplifies the outliers from embedding dimensions and (2) when clipping the outliers and

evaluating the final performance, the importance of outliers is highly varied. For the first

finding, the scaling parameter γ in the LayerNorm structure works as an outlier amplifier,

which amplifies the outliers in the output. For token t at j-th embedding dimension, the

LayerNorm is defined as follows:

˜Xt,j = Xt,jμt

σ2

t + ϵ

· γj + βj,

(5.13)

where μt and σ2

t are the mean and variance of token t, respectively. Then, by observing the

formula of LayerNorm, the multiplier γ plays a crucial part in amplifying the magnitude of

the token t, as shown in Fig. 5.8 Thus, they propose to remove the amplification effect by

extracting γ from Eq. (5.13) and use the Non-scaling LayerNorm Eq. (5.14):

X

t,j = Xt,jμt

σ2

t + ϵ

· γj + βj

γ ,

(5.14)

Since the magnitude of the token t is shortening by extracting γ, the resulting X behaves

more friendly than ˜X for quantization.

For the second finding, they discover that the influence of final performance when clip-

ping the outliers varies greatly. In particular, when clipping the outliers and evaluating the

final performance, they find that the importance of outliers is highly varied. Take the out-

liers after GELU as an example. Fig. 5.9 shows that clipping the more aggressive outliers

sharply (clipping signals in 10-100 to 10) even does not hurt the full-precision performance

with accuracy still at 91.02. At the same time, the accuracy drops suddenly to 85.93 with

too many outliers cut. In addition, though those less important outliers might present in a

long tail form, they are only provided by a few tokens. In particular, unimportant outliers

which can be clipped without even any accuracy drop in FP models only correspond to

a few tokens. From the red points in Fig. 5.9, which represents the proportion of clipped

tokens, it can be clearly seen that the more aggressive outliers though occupy a large range

from 10 to 100 only matches with 3% tokens. Destroying those sharper outliers belonging

to a few tokens will not affect the performance.